Customer Segmentation using Bank Transactions by implementing K-Means Clustering, DBSCAN and Gaussian Mixture Model.¶

Author- Akash Tiwari (229028)

Goal of this Notebook :¶

This notebook aims at segmenting customers using the bank transaction data having 20,000 rows sampled randomly from a dataset consisting of 10 lakh bank transaction. For segmenting the data is preprocessed, converted into RFM metrics and then scaled to be feeded to three algorithms k-means clustering, Gaussian Mixture Model and DBSCAN.

Contents of the Notebook :

  • Importing the Libraries and Loading the data
  • Data Preprocessing
  • RFM(Recency, Frequency and Monetary) Metrics Evaluation
  • Dashboard and Visualization
  • Feature Engineering
  • Principal Component Analysis
  • Implenting the Algorithms
  • Conclusion
  • Difficulties and further improvements

Importing the Libraries and Loading the data¶

In [1]:
#Importing the necessary Libraries 

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
%matplotlib inline
from sklearn import preprocessing
from sklearn.cluster import KMeans,DBSCAN
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from datetime import datetime,date

import re
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import seaborn as sns
try:
    import plotly.express as px
    import plotly.graph_objects as go

except:
    !pip install plotly
    import plotly.express as px
    import plotly.graph_objects as go


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import scipy.cluster.hierarchy as sch
from sklearn. preprocessing import StandardScaler
from sklearn.cluster import KMeans 
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

try:
    from kneed import KneeLocator
except:
    !pip install kneed
    from kneed import KneeLocator
#------------------------------------------------------------------
try:
    from yellowbrick.cluster import KElbowVisualizer
except:
    !pip install -U yellowbrick
    from yellowbrick.cluster import KElbowVisualizer
In [2]:
#Importing the dataset
df=pd.read_csv('bank_transactions_red.csv')

Data Preprocessing¶

In [3]:
# Printing the first five rows to get the view of the dataset 
print(f'The dataset has {df.shape[0]} records and {df.shape[1]} columns')
df.head()
The dataset has 20000 records and 9 columns
Out[3]:
TransactionID CustomerID CustomerDOB CustGender CustLocation CustAccountBalance TransactionDate TransactionTime TransactionAmount (INR)
0 T892846 C5725279 8/9/84 M JASPUR 84447.82 7/9/16 183210 11858.0
1 T444995 C4588538 2/1/94 M GURGAON 12549.49 16/8/16 161300 250.0
2 T614897 C2416476 14/11/90 M NEW DELHI 33607.65 26/8/16 102007 3360.0
3 T457037 C5940151 15/9/90 M HYDERABAD 38238.86 21/8/16 110438 250.0
4 T997442 C5922989 27/11/88 M PURBO MEDINIPUR 9506.85 14/9/16 90810 33.0

The dataset captures the transactions for a bank in India. It has the following columns :

  • TransactionID : Each transaction has an unique identification number.
  • CustomerID : Each customer has an unique identification number.
  • CustGender : Gender of the customer.
  • CustLocation : The location where a particular customer belongs.
  • CustAccountBalance : To keep the record of account balance after each transactions.
  • TransactionDate : To record the date of each transaction.
  • TransactionTime : To record the time for each transaction.
  • Transaction : To record the transacted amount.
In [4]:
# Now we see the data types and the count of non null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   TransactionID            20000 non-null  object 
 1   CustomerID               20000 non-null  object 
 2   CustomerDOB              19936 non-null  object 
 3   CustGender               19980 non-null  object 
 4   CustLocation             19999 non-null  object 
 5   CustAccountBalance       19953 non-null  float64
 6   TransactionDate          20000 non-null  object 
 7   TransactionTime          20000 non-null  int64  
 8   TransactionAmount (INR)  20000 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 1.4+ MB
In [5]:
## The function defined below records the data type, number of unique values and the number of null values for each column in the dataframe 
def check(df):
    l=[]
    columns=df.columns
    for col in columns:
        dtypes=df[col].dtypes
        nunique=df[col].nunique()
        sum_null=df[col].isnull().sum()
        l.append([col,dtypes,nunique,sum_null])
    df_check=pd.DataFrame(l)
    df_check.columns=['column','dtypes','nunique','sum_null']
    return df_check 
check(df)
Out[5]:
column dtypes nunique sum_null
0 TransactionID object 20000 0
1 CustomerID object 19932 0
2 CustomerDOB object 7742 64
3 CustGender object 2 20
4 CustLocation object 1866 1
5 CustAccountBalance float64 17236 47
6 TransactionDate object 54 0
7 TransactionTime int64 17036 0
8 TransactionAmount (INR) float64 5492 0

From the above dataframe we can see that we have 20000 unique transactions with 64 null values in CustomerDOB, 20 null values in CustGender,1 null values in CustLocation and 47 null values in CustAccountBalance. As the number of null values is less as compared with the total records, we can drop the null records from the dataframe.

Also the datatype of CustomerDOB and TransactionTime is string so for further analysis and calculations these columns are needed to be converted into datetime datatype.

Dealing with the null values¶

In [6]:
#Dropping the null values in the original dataframe 
df.dropna(inplace=True)
df.isnull().sum()
Out[6]:
TransactionID              0
CustomerID                 0
CustomerDOB                0
CustGender                 0
CustLocation               0
CustAccountBalance         0
TransactionDate            0
TransactionTime            0
TransactionAmount (INR)    0
dtype: int64

Now the dataframe has no null values, we can proceed with further preprocessing.

Converting the datatype¶

In [7]:
#Converting the datatype of TransactionDate and CustomerDOB from string to Datetime 
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

Further Cleaning and feature engineering¶

In [8]:
df['CustomerDOB'].value_counts()
Out[8]:
1800-01-01    1059
1990-01-07      19
1991-01-01      18
1990-10-08      16
1989-01-01      15
              ... 
1995-01-18       1
2068-10-30       1
1983-05-29       1
1997-01-29       1
2071-12-19       1
Name: CustomerDOB, Length: 7726, dtype: int64

We can observe that the DOD 1800-01-01 has frequency 5385 but while calculating the age of such customers we will face an anamoly. Customer born in the year 1800 or the 18th century wont alive in the current era and also they wont be relevant in the behavioral study of current generation. This is why we drop the records having DOB as 1800-01-01.

Also if we observe there are DOB with year more than 2023 which again is an anamoly. One of the choice is to drop such records but this can be a recording error and this is why instead of dropping we are going to reduce the year by 100 for records having birth year greater than 2023

In [9]:
df = df.drop(df[df['CustomerDOB'] == '1800-01-01'].index,axis = 0)
df.loc[df.CustomerDOB.dt.year >= 2023, 'CustomerDOB'] = df.loc[df.CustomerDOB.dt.year >= 2023, 'CustomerDOB'] - pd.DateOffset(years = 100)
df.head()
Out[9]:
TransactionID CustomerID CustomerDOB CustGender CustLocation CustAccountBalance TransactionDate TransactionTime TransactionAmount (INR)
0 T892846 C5725279 1984-08-09 M JASPUR 84447.82 2016-07-09 183210 11858.0
1 T444995 C4588538 1994-02-01 M GURGAON 12549.49 2016-08-16 161300 250.0
2 T614897 C2416476 1990-11-14 M NEW DELHI 33607.65 2016-08-26 102007 3360.0
3 T457037 C5940151 1990-09-15 M HYDERABAD 38238.86 2016-08-21 110438 250.0
4 T997442 C5922989 1988-11-27 M PURBO MEDINIPUR 9506.85 2016-09-14 90810 33.0
In [10]:
df['CustomerDOB'].value_counts()
Out[10]:
1990-01-07    19
1991-01-01    18
1990-10-08    16
1989-01-01    15
1994-01-01    14
              ..
1995-01-18     1
1968-10-30     1
1983-05-29     1
1997-01-29     1
1971-12-19     1
Name: CustomerDOB, Length: 7725, dtype: int64

All the anomalies in the CustomerDOB column have been treated now we can calculate the age using this column and add to the dataframe the age column

In [11]:
df['Customer_age'] = ((pd.to_datetime('today') - df['CustomerDOB'])/np.timedelta64(1, 'Y')).round(0)
df.head()
Out[11]:
TransactionID CustomerID CustomerDOB CustGender CustLocation CustAccountBalance TransactionDate TransactionTime TransactionAmount (INR) Customer_age
0 T892846 C5725279 1984-08-09 M JASPUR 84447.82 2016-07-09 183210 11858.0 39.0
1 T444995 C4588538 1994-02-01 M GURGAON 12549.49 2016-08-16 161300 250.0 29.0
2 T614897 C2416476 1990-11-14 M NEW DELHI 33607.65 2016-08-26 102007 3360.0 32.0
3 T457037 C5940151 1990-09-15 M HYDERABAD 38238.86 2016-08-21 110438 250.0 32.0
4 T997442 C5922989 1988-11-27 M PURBO MEDINIPUR 9506.85 2016-09-14 90810 33.0 34.0

The Customer_age column has been added to the dataframe

In [12]:
## Checking the duplicate values
df.duplicated().sum()
Out[12]:
0

The dataset has no duplicate values to handle

Outlier Analysis¶

In [13]:
#Storing the numerical and catergorical columns in two different variables

num_col = df.select_dtypes(include=np.number)
cat_col = df.select_dtypes(exclude=np.number)
In [14]:
plt.style.use("fivethirtyeight")
plt.figure(figsize=(30,30))
for index,column in enumerate(num_col):
    plt.subplot(7,4,index+1)
    sns.boxplot(data=num_col,x=column)
    
plt.tight_layout(pad = 1.0)

The above boxplot shows that the CustAccountBalance, TransactionAmount and Customer_age has many outliers. Either we can remove these outliers or substitute them with relevant values as per the distribution of the observations. But Since this is a Bank Transaction data, and which has been collected over time,I have decided to not manipulate the outliers. The reason being that these columns can have significantly varying values as per the customers purchasing power and financial background. And these boundary values further can help in segmentation of a common behavioural pattern.

Domain Prerequisite: RFM Metrics Evaluation¶

Incontent_image.png

What is RFM used for?¶

RFM became popular in mailing marketing during the 60s. This technique is still being used and applied to digital services and used to influence hundreds (and even millions) of eCommerce clients.

The benefits of RFM:

  • Higher efficiency: marketing actions (and their impact) are optimized.
  • Higher conversion rates: CVR increases.
  • Better loyalty-building practices: customer retention improves.
  • More interaction: response rates increase.
  • More revenue: sales are boosted.

How does it work?¶

Here, Each of the three variables(Recency, Frequency, and Monetary) consists of four equal groups, which creates 64 (4x4x4) different customer segments.

Steps of RFM(Recency, Frequency, Monetary):

  • Calculate the Recency, Frequency, Monetary values for each customer.
  • Add segment bin values to RFM table using quartile.
  • Sort the customer RFM score in ascending order.

For further understanding of RFM below links can be referred

What is RFM and how to calculate them

Implementing RFM using Python for Customer Segmentation

Implementing RFM strategy¶

In [15]:
df['TransactionDate1']=df['TransactionDate'] # ==> to calculate the minimum (first transaction)
df['TransactionDate2']=df['TransactionDate'] # ==> to calculate the maximum (last transaction)

Now for creating a RFM table the columns of the dataframe needs to be aggregated accordingly so that we can calculate the Recency, frequency and the monetary value. In the below cell the aggregation strategy is defined on the df dataframe.

In [16]:
#Creating RMF Table Strategy

RMF_df = df.groupby("CustomerID").agg({
                                        "TransactionID" : "count",
                                        "CustGender" : "first",
                                        "CustLocation":"first",
                                        "CustAccountBalance"  : "mean",
                                        "TransactionAmount (INR)" : "mean",
                                        "Customer_age" : "median",
                                        "TransactionDate2":"max",
                                        "TransactionDate1":"min",
                                        "TransactionDate":"median"
                        })

RMF_df = RMF_df.reset_index()
RMF_df.head()
Out[16]:
CustomerID TransactionID CustGender CustLocation CustAccountBalance TransactionAmount (INR) Customer_age TransactionDate2 TransactionDate1 TransactionDate
0 C1010024 1 M KOLKATA 87058.65 5000.0 58.0 2016-08-18 2016-08-18 2016-08-18 00:00:00
1 C1010243 2 M TEHRI GARHWAL 98513.39 464.5 41.0 2016-03-08 2016-02-08 2016-02-22 12:00:00
2 C1010279 1 F ANDHERI WEST MUMBAI 137.91 368.0 35.0 2016-08-13 2016-08-13 2016-08-13 00:00:00
3 C1010517 1 M SRIPERUMBUDUR 14750.09 150.0 28.0 2016-05-08 2016-05-08 2016-05-08 00:00:00
4 C1010655 1 M MEERUT 45856.24 712.0 32.0 2016-08-24 2016-08-24 2016-08-24 00:00:00

Shape of the newly created RMF dataframe

In [17]:
RMF_df.shape
Out[17]:
(18747, 10)

We drop the CustomerID column from the RMF dataframe as it's not necessary with the further analysis

In [18]:
RMF_df.drop(columns=["CustomerID"],inplace=True)

The strategy that we have implemented for the TransactionID column in RMF table is nothing but the count of number of transaction for each customer, which is nothing but the frequency. This is why in the next cell the column is renamed to frequency.

The average TransactionAmount is the monetary value for each of the customers, so we rename that column as well.

In [19]:
#Defining the Frequency 
RMF_df.rename(columns={"TransactionID":"Frequency","TransactionAmount (INR)":"Monetary"},inplace=True)

Recency is the time period between the transaction. and we will calculate that by calculating the days between the latest and the previous older transaction. We implement this strategy using the two coloums Transaction Date1 (latest transaction) and TransactionDate2 (oldest transaction).

In [20]:
#Defining the Recency
RMF_df['Recency']=RMF_df['TransactionDate2']-RMF_df['TransactionDate1']
RMF_df['Recency']=RMF_df['Recency'].astype(str)
RMF_df['Recency']=RMF_df['Recency'].apply(lambda x :re.search('\d+',x).group())
RMF_df['Recency']=RMF_df['Recency'].astype(int)
#0 days means that a customer has done transaction recently, one time by logic so I will convert 0 to 1
def rep_0(i):
    if i==0:
        return 1
    else:
        return i
RMF_df['Recency']=RMF_df['Recency'].apply(rep_0)
In [21]:
RMF_df.head()
Out[21]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age TransactionDate2 TransactionDate1 TransactionDate Recency
0 1 M KOLKATA 87058.65 5000.0 58.0 2016-08-18 2016-08-18 2016-08-18 00:00:00 1
1 2 M TEHRI GARHWAL 98513.39 464.5 41.0 2016-03-08 2016-02-08 2016-02-22 12:00:00 29
2 1 F ANDHERI WEST MUMBAI 137.91 368.0 35.0 2016-08-13 2016-08-13 2016-08-13 00:00:00 1
3 1 M SRIPERUMBUDUR 14750.09 150.0 28.0 2016-05-08 2016-05-08 2016-05-08 00:00:00 1
4 1 M MEERUT 45856.24 712.0 32.0 2016-08-24 2016-08-24 2016-08-24 00:00:00 1

The columns TransactionDate1 and TransactionDate2 will be droped as they were just needed to calculate the recency that we have already calculated.

In [22]:
RMF_df.drop(columns=["TransactionDate1","TransactionDate2"],inplace=True)

Now that we have the RMF table created we will look for any potential outliers in the data

In [23]:
# to claculate the otliers for each feature
lower_list=[]
upper_list=[]
num_list=[]
perc_list=[]
cols=['Frequency', 'CustAccountBalance','Monetary', 'Customer_age', 'Recency']
for i in cols:
    Q1 = RMF_df[i].quantile(0.25)
    Q3 = RMF_df[i].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    # calculate number of outliers
    num=RMF_df[(RMF_df[i] < lower) | (RMF_df[i] > upper)].shape[0]
    # calculate percentage of outliers
    perc = (num / RMF_df.shape[0]) * 100
    lower_list.append(lower)
    upper_list.append(upper)
    num_list.append(num)
    perc_list.append(round(perc,2))

    
dic={'lower': lower_list, 'upper': upper_list, 'outliers': num_list, 'Perc%':perc_list }
outliers_df=pd.DataFrame(dic,index=['Frequency', 'CustAccountBalance','Monetary', 'Customer_age', 'Recency'])
outliers_df
Out[23]:
lower upper outliers Perc%
Frequency 1.0000 1.0000 62 0.33
CustAccountBalance -68211.9425 125958.9175 2491 13.29
Monetary -1276.5000 2527.5000 2057 10.97
Customer_age 18.5000 54.5000 1008 5.38
Recency 1.0000 1.0000 57 0.30

As stated previously that the outliers present cant be removed as they themselves can be used to study the behavioural pattern of such clustered values

In [24]:
RMF_df['TransactionDay'] = RMF_df['TransactionDate'].dt.dayofweek
RMF_df['TransactionMonth'] = RMF_df['TransactionDate'].dt.month
RMF_df.head()
Out[24]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age TransactionDate Recency TransactionDay TransactionMonth
0 1 M KOLKATA 87058.65 5000.0 58.0 2016-08-18 00:00:00 1 3 8
1 2 M TEHRI GARHWAL 98513.39 464.5 41.0 2016-02-22 12:00:00 29 0 2
2 1 F ANDHERI WEST MUMBAI 137.91 368.0 35.0 2016-08-13 00:00:00 1 5 8
3 1 M SRIPERUMBUDUR 14750.09 150.0 28.0 2016-05-08 00:00:00 1 6 5
4 1 M MEERUT 45856.24 712.0 32.0 2016-08-24 00:00:00 1 2 8
In [25]:
### Final RMF Dataframe
RMF_df.head()
Out[25]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age TransactionDate Recency TransactionDay TransactionMonth
0 1 M KOLKATA 87058.65 5000.0 58.0 2016-08-18 00:00:00 1 3 8
1 2 M TEHRI GARHWAL 98513.39 464.5 41.0 2016-02-22 12:00:00 29 0 2
2 1 F ANDHERI WEST MUMBAI 137.91 368.0 35.0 2016-08-13 00:00:00 1 5 8
3 1 M SRIPERUMBUDUR 14750.09 150.0 28.0 2016-05-08 00:00:00 1 6 5
4 1 M MEERUT 45856.24 712.0 32.0 2016-08-24 00:00:00 1 2 8

Dashboard and Visualization¶

In [26]:
# correlation between features
plt.figure(figsize=(7,5))
correlation=RMF_df.corr()
sns.heatmap(correlation,vmin=None,
    vmax=0.8,
    cmap='rocket_r',
    annot=True,
    fmt='.1f',
    linecolor='white',
    cbar=True);

The above map shows that we dont have to worry about the collinearity problem as none of the columns have significant correlation

In [27]:
plt.style.use("fivethirtyeight")
chart=sns.countplot(x='Frequency',data=RMF_df,palette='rocket', order = RMF_df['Frequency'].value_counts().index)
plt.title("Frequency",
          fontsize='20',
          backgroundcolor='AliceBlue',
          color='magenta');

This shows that the frequency for the transactions for each customer is 1, that is we have the data where each customer has single transaction recorded for them

In [28]:
plt.hist(RMF_df['TransactionMonth'], color='yellow', edgecolor='black',bins = 10, linewidth=2)
plt.title('Transactions in each month')
plt.show()

Th graph shows that the number of transaction have spiked in the month of August

In [29]:
plt.hist(RMF_df['TransactionDay'], color='yellow', edgecolor='black',bins = 10, linewidth=2)
plt.title('Transactions in each day of the week')
plt.show()

The transactions are uniformly distributed throughout the days of a week

In [30]:
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(15,5))
palette_color = sns.color_palette('rocket')
ax[0].hist(x=RMF_df['Customer_age'],color='purple')
ax[0].set_title("Distribution of Customer Age")
ax[1].pie(RMF_df['CustGender'].value_counts(),autopct='%1.f%%',colors=palette_color,labels=['Male','Female'])
ax[1].set_title("Customer Gender")
plt.tight_layout();

The Distribution of age is positively skewed, with the age group of 30-40 having the highest frequency.

The Customer gender is dominated by male records with 73% and only 27% female records

In [31]:
plt.style.use("fivethirtyeight")
plt.figure(figsize=(15,7))
chart=sns.countplot(y='CustLocation',data=RMF_df,palette='rocket', order = RMF_df['CustLocation'].value_counts()[:20].index)
plt.title("Most 20 Location of Customer ",
          fontsize='20',
          backgroundcolor='AliceBlue',
          color='magenta');

This shows that the transactions recorded are mostly of customers belonging to Mumbai, Bangalore, New Delhi, Gurgaon, Delhi, Noida, Chennai and Pune. These are the major metropolitan areas

In [32]:
plt.style.use("fivethirtyeight")
sns.scatterplot(x='Monetary',y='CustAccountBalance',data=RMF_df,palette='rocket',hue='Frequency',size='Recency' )
plt.title("Monetary and CustAccountBalance",
          fontsize='20',
          backgroundcolor='AliceBlue',
          color='magenta');

This is one of the most important plots where we can the spread of data with respect to Monetary and CustAccountBalance. Frequency is shows using different colors and recency is depicted using different size of the data points.

As the frequency is 1 for majority of the customers the data point has the same colour and the and even the recency is majorly 1 for the customers.

The majority behaviour corresponds to the customer having acoount balance between 0-1000000 and the monetary value from 0-8000 roughly.

Month wise distribution of the values for each column¶

In [33]:
# Sorting and grouping the columns monthwise to study the average behaviour overtime
RMF_df=RMF_df.sort_values(by='TransactionDate')
groupbby_month=RMF_df.groupby([pd.Grouper(key='TransactionDate',freq='M')]).mean()
print(groupbby_month.shape)
groupbby_month
(12, 7)
Out[33]:
Frequency CustAccountBalance Monetary Customer_age Recency TransactionDay TransactionMonth
TransactionDate
2016-01-31 1.000000 69906.068790 1400.891081 37.571429 1.000000 4.477477 1.0
2016-02-29 1.002656 82857.049880 1751.908858 37.745020 1.158035 0.543161 2.0
2016-03-31 1.000000 90811.376536 1339.061988 37.467857 1.000000 1.588095 3.0
2016-04-30 1.000000 99780.835910 1229.883830 37.484634 1.000000 4.578014 4.0
2016-05-31 1.002660 77862.819568 1421.804428 37.194814 1.523936 2.977394 5.0
2016-06-30 1.015421 79584.377165 1854.192491 37.495848 3.112693 2.435350 6.0
2016-07-31 1.009259 90065.473131 1601.971701 37.425347 2.035880 4.417824 7.0
2016-08-31 1.002204 99542.131787 1479.473098 37.877723 1.060296 2.857365 8.0
2016-09-30 1.004923 85972.026034 1310.581846 37.740155 1.182841 2.909634 9.0
2016-10-31 1.005501 94627.117992 1747.632068 38.239274 1.430143 5.418042 10.0
2016-11-30 1.001164 78637.287247 1206.780780 37.460419 1.000000 1.551804 11.0
2016-12-31 1.001337 101923.685107 1519.361190 38.229278 1.000000 3.466578 12.0

Average frequency remains the same throughout but if we observe the average Monetary was highest in the month of june and the recency was also highest in June.

In [34]:
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(15,5))
ax[0].plot(groupbby_month.index,groupbby_month['CustAccountBalance'],color='purple',marker='o',label='Customer Account Balance')
ax[0].set_title('Cust Account Balance Over The Time')
ax[1].plot(groupbby_month.index,groupbby_month['Monetary'],color='purple',marker='o',label='Monetary')
ax[1].set_title("Monetary Over The Time")
plt.legend();

The plot shows the behaviour of average customer account balance and the recency over the period of different months.

As we would expect when the recency us low the account balance would be high and that is what we can observe in the graphs as well.

Feature Engineering using RFM¶

In [35]:
RMF_df=RMF_df.reset_index(drop=True)
RMF_df.head()
Out[35]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age TransactionDate Recency TransactionDay TransactionMonth
0 1 M BHOPAL 33206.65 300.00 33.0 2016-01-08 1 4 1
1 1 M BANGALORE 12819.30 20.00 38.0 2016-01-08 1 4 1
2 1 M THANE 89365.55 1484.52 46.0 2016-01-08 1 4 1
3 1 M KOTA 13956.38 1150.00 42.0 2016-01-08 1 4 1
4 1 F SALEM 22091.49 500.00 39.0 2016-01-08 1 4 1
In [36]:
lab = preprocessing.LabelEncoder()
RMF_df['CustGender']= lab.fit_transform(RMF_df['CustGender'])
RMF_df['CustLocation']= lab.fit_transform(RMF_df['CustLocation'])
RMF_df.head()
Out[36]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age TransactionDate Recency TransactionDay TransactionMonth
0 1 1 225 33206.65 300.00 33.0 2016-01-08 1 4 1
1 1 1 150 12819.30 20.00 38.0 2016-01-08 1 4 1
2 1 1 1538 89365.55 1484.52 46.0 2016-01-08 1 4 1
3 1 1 808 13956.38 1150.00 42.0 2016-01-08 1 4 1
4 1 0 1337 22091.49 500.00 39.0 2016-01-08 1 4 1
In [37]:
RMF_df.drop(['TransactionDate','TransactionDay','TransactionMonth'],axis=1,inplace=True)
RMF_df.head()
Out[37]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age Recency
0 1 1 225 33206.65 300.00 33.0 1
1 1 1 150 12819.30 20.00 38.0 1
2 1 1 1538 89365.55 1484.52 46.0 1
3 1 1 808 13956.38 1150.00 42.0 1
4 1 0 1337 22091.49 500.00 39.0 1
In [38]:
# data scaling
df_scaled=StandardScaler().fit_transform(RMF_df)
df_scaled=pd.DataFrame(df_scaled,columns=RMF_df.columns)
df_scaled.head()
Out[38]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age Recency
0 -0.057165 0.608518 -1.216462 -0.159456 -0.221783 -0.543911 -0.042985
1 -0.057165 0.608518 -1.389119 -0.214844 -0.275029 0.028657 -0.042985
2 -0.057165 0.608518 1.806193 -0.006884 0.003472 0.944767 -0.042985
3 -0.057165 0.608518 0.125661 -0.211755 -0.060142 0.486712 -0.042985
4 -0.057165 -1.643336 1.343471 -0.189653 -0.183750 0.143171 -0.042985

Implementing PCA¶

The intention behind applying PCA is to get the features along which the variation of the data is explained majorly. This will help with further clustering.

In [39]:
pca = PCA()
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])
In [40]:
pca.explained_variance_ratio_
Out[40]:
array([0.20695732, 0.17047326, 0.1665273 , 0.16488166, 0.15265303,
       0.13850744])
In [41]:
plt.plot(list(range(1,7)),pca.explained_variance_ratio_)
plt.axis([0,14,0,max(pca.explained_variance_ratio_)+0.05])
plt.xticks(list(range(1,14)))
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
In [42]:
plt.plot(list(range(1,7)),np.cumsum(pca.explained_variance_ratio_))
plt.axis([0,14,0,1.1])
plt.axhline(y=0.8,color='r',linestyle='--',linewidth=1)
plt.xticks(list(range(1,14)))
plt.xlabel('# of Principal Components')
plt.ylabel('Cummulative Variance Explained')
plt.show()

From the above analysis we conclude that 5 principal components are enough to explain almost 85 % of the variation in the data and this is why we will fit and evualate 5 principal components for further analysis.

In [43]:
pca = PCA(n_components=5)
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])

Implementing Clustering Algorithms¶

K-mean¶

The k-means clustering is a method of vector quantization, originally from signal processing, that

The Elbow method is a graphical representation of finding the optimal 'K' in a K-means cluster

The Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1. 1: Means clusters are well apart from each other and clearly distinguished

The Dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering

In [44]:
import random
from yellowbrick.cluster import KElbowVisualizer
model = KMeans(init = 'k-means++', random_state = 42)

visualizer = KElbowVisualizer(model, k=(2,30), timings=False)
visualizer2 = KElbowVisualizer(model, k=(2,30), timings=False)
In [45]:
visualizer.fit(pca_data)
visualizer.show()
Out[45]:
<AxesSubplot: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

As per the elbow plot having 8 cluster would be a fair guess for segmenting the customers

In [46]:
plt.figure(figsize=(30,10))
z = sch.linkage(pca_data[:350,:], 'ward')
sch.dendrogram(z);

even the dendogram shows that 8 clusters is good enough at the height 5 of the tree, for clustering of the given data

In [47]:
kmeans_set = {"init":"random", "max_iter":300, "random_state":42}
PCA_kmeans = KMeans(n_clusters=8 , **kmeans_set)    
PCA_kmeans.fit(pca_data)
df_scaled['label'] = PCA_kmeans.labels_
ss_k=silhouette_score(pca_data,PCA_kmeans.labels_)
df_scaled.head()
Out[47]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age Recency label
0 -0.057165 0.608518 -1.216462 -0.159456 -0.221783 -0.543911 -0.042985 1
1 -0.057165 0.608518 -1.389119 -0.214844 -0.275029 0.028657 -0.042985 1
2 -0.057165 0.608518 1.806193 -0.006884 0.003472 0.944767 -0.042985 4
3 -0.057165 0.608518 0.125661 -0.211755 -0.060142 0.486712 -0.042985 4
4 -0.057165 -1.643336 1.343471 -0.189653 -0.183750 0.143171 -0.042985 2
In [48]:
df_scaled.label.value_counts()
Out[48]:
1    6310
4    5999
6    2521
2    2306
5    1424
3      75
0      62
7      50
Name: label, dtype: int64

As cluter 0,7 and 3 have less representatives we won't plot them in the next plot.

In [49]:
attributes =df_scaled.columns[0:7]
color={2:'purple',4:'yellow',1:'red',5:'blue',6:'orange',7:'green'}


for index in color:
    fig = go.Figure()
    fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label == index, attributes]),
                                  theta=attributes,
                                  fill='toself',
                                  fillcolor=color[index],
                                  opacity=0.4,
                                  name='cluster'+str(index)))
    fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
                  title="Radar plot - K-Means Cluster"+str(index))
    fig.show()

from the above plot we can interpret the following

  • Customers in cluster 6 have higher account balance and monetary values. it also shows that they have higher age and recency. So the bank can target this cluster for loans and other benefits

  • Cluster 2 on the other hand has younger customer with low monetary and account balance. Even the recency has moderate value in this cluster.

  • Cluster 5 has the oldest customers in the dataset with very low recency value, that is they don't transact frequently.

Gaussin Mixture Model¶

Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don't require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. Since subpopulation assignment is not known, this constitutes a form of unsupervised learning.

In [50]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2).fit(pca_data)
labels = gmm.predict(pca_data)
In [51]:
df_scaled['label_gm'] = labels
ss_g=silhouette_score(pca_data,labels)
df_scaled.head()
Out[51]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age Recency label label_gm
0 -0.057165 0.608518 -1.216462 -0.159456 -0.221783 -0.543911 -0.042985 1 1
1 -0.057165 0.608518 -1.389119 -0.214844 -0.275029 0.028657 -0.042985 1 1
2 -0.057165 0.608518 1.806193 -0.006884 0.003472 0.944767 -0.042985 4 1
3 -0.057165 0.608518 0.125661 -0.211755 -0.060142 0.486712 -0.042985 4 1
4 -0.057165 -1.643336 1.343471 -0.189653 -0.183750 0.143171 -0.042985 2 1
In [52]:
df_scaled.label_gm.value_counts()
Out[52]:
1    16472
0     2275
Name: label_gm, dtype: int64
In [53]:
attributes =df_scaled.columns[0:7]
color={0:'purple',1:'yellow'}


for index in color:
    fig = go.Figure()
    fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label_gm == index, attributes]),
                                  theta=attributes,
                                  fill='toself',
                                  fillcolor=color[index],
                                  opacity=0.4,
                                  name='cluster'+str(index)))
    fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
                  title="Radar plot -Gaussian Mixture Model Cluster"+str(index))
    fig.show()

The Gaussian Mixture Model gives better silhouette score for two clusters

  • The cluster 0 has customers with greater account balance and age. But the recency is low for this cluster
  • The cluster 1 has customers with higher recency but lower age and account balance

After k-means, Gaussian Mixture Model has given two more clusters that can be used for studying the behaviours of the customer

DBSCAN¶

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

Why do we need DBSCAN?

K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They are not able to form clusters based on varying densities. That’s why we need DBSCAN clustering.

the parameter involved are epsilon( radius of the clusters) and min_samples(minimum number of points to form a cluster)

Finding the optimal value of epsilon using nearest neighbour¶

In [54]:
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
neighbors = NearestNeighbors(n_neighbors=100)
neighbors_fit = neighbors.fit(pca_data)
distances, indices = neighbors_fit.kneighbors(pca_data)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
Out[54]:
[<matplotlib.lines.Line2D at 0x13eb8a61ea0>]

From the graph we can observe that the distance changes drastically from at around 0.8 This shows that after this distance the density of the points becomes less and we can use epsilon as 0.8 for fairly forming the clusters

In [55]:
dbscan_cluster_x = DBSCAN(eps=0.8, min_samples=100)
dbscan_cluster_x.fit(pca_data)
Out[55]:
DBSCAN(eps=0.8, min_samples=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DBSCAN(eps=0.8, min_samples=100)
In [56]:
labels=dbscan_cluster_x.labels_
df_scaled['label_db'] = labels
ss_d=silhouette_score(pca_data,labels)
df_scaled.head()
Out[56]:
Frequency CustGender CustLocation CustAccountBalance Monetary Customer_age Recency label label_gm label_db
0 -0.057165 0.608518 -1.216462 -0.159456 -0.221783 -0.543911 -0.042985 1 1 0
1 -0.057165 0.608518 -1.389119 -0.214844 -0.275029 0.028657 -0.042985 1 1 0
2 -0.057165 0.608518 1.806193 -0.006884 0.003472 0.944767 -0.042985 4 1 0
3 -0.057165 0.608518 0.125661 -0.211755 -0.060142 0.486712 -0.042985 4 1 0
4 -0.057165 -1.643336 1.343471 -0.189653 -0.183750 0.143171 -0.042985 2 1 1
In [57]:
df_scaled.label_db.value_counts()
Out[57]:
 0    13174
 1     4800
-1      773
Name: label_db, dtype: int64
In [58]:
attributes =df_scaled.columns[0:7]
color={0:'purple',1:'yellow',-1:'blue'}


for index in color:
    fig = go.Figure()
    fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label_db == index, attributes]),
                                  theta=attributes,
                                  fill='toself',
                                  fillcolor=color[index],
                                  opacity=0.4,
                                  name='cluster'+str(index)))
    fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
                  title="Radar plot -DBSCAN Cluster"+str(index))
    fig.show()

The above plots show three clusters

  • Cluster 1 of DBSCAN corresponds with the cluster 6 of k-means having high account balance, recency and age
  • Cluster -1 of DBSCAN is the more refined cluster as compared to the cluster 2 of kmeans having age and account balance less than along with receny

In conclusion in K-means we had 8 clusters out of which 2 had well defined behaviours and now using DBSCAN we have obtained 3 clusters with two of them corresponding to the other 2 defined by K-Means

In [59]:
# Comparing the silhoutte score
compare=pd.DataFrame({'Algorithm':['K-means','Gausian Mixture','DBSCAN'],'Silhoutte Score':[ss_k,ss_g,ss_d]})
compare
Out[59]:
Algorithm Silhoutte Score
0 K-means 0.429659
1 Gausian Mixture 0.404078
2 DBSCAN 0.416331

Conclusion :¶

The performance of all the three clustering algorithm was fairly similar on the given dataset. We define the conclusion for each algorithm here

K-Means Clutering :¶

  • We obtained 8 clusters with the silhoutte score of 0.429.
  • Out of these 8 cluster we have two cluster which can be further used for behavioural study. #### Gaussian Mixture Model:
  • We obtained 2 clusters with the silhoutte score of 0.4033.
  • Cluster 1 had customer with high monetary and account balance but less recency
  • Cluster 2 had customer with low monetary and account balance but high recency and frequency #### DBSCAN :
  • We obtained 3 clusters with the silhoutte score of 0.41633
  • The cluster obtained we similar to the two dominating cluster of k-means

We conclude that different algorithms give different clusters of varying behaviour, it's then upto the stakeholders to decide which cluster they will be focusing on for further analysis.

Difficulties and Further Improvements¶

  • The original dataset had 10 lakh transactions, this posed a difficulty of implementing resources to manage data of such high scale.
  • The data used in this notebook was sampled from the 10 lakh record randomly. This has introduced the sampling error which can be further analysed and can be dealt with.
  • Clustering needs well defineds features to use for segmenting, calculating RFM metrics from the given sampled dataset was challenge.
  • The notebook can be improved by using a better sampled dataset representing the entire transactions and further analysis can be done based on each feature to gain better insights of the clusters.
In [ ]: